Star Wars: The Data Awakens

Welcome to Cornell College!

Introductions

  • Professor Tyler George

  • Student Robin Gillette

Famous Quotes

  • Who has a favorite Star Wars quote?

  • Would you say that quote is positive or negative?

  • Famous quotes, according to The Hollywood Reporter

    • Luke, I am your father
    • Just kidding, that is not the real quote!

Text Analysis

  • A popular technique in Data Science is text analysis.

  • What do you think text analysis is (beyond analysis of text)?

Sentiment Analysis

  • Today, you will all be trying your hand at a type of text analysis called sentiment analysis.

  • Any educated guesses on what sentiment analysis is?

Activity

  • At each of your tables, you have:
    • Star Wars Movie quotes
    • What we call a lexicon dictionary of sentiments.
      • This one is called Bing
      • All words are either positive or negative

Activity Instructions

  • Fill in the table with each word of your quotes on the left and their sentiment (positive or negative) on the right.

  • Count the positives and negatives and write your totals at the bottom.

  • Scan the QR code and enter your movie (IV, V, or VI) and the number of positive sentiment words and negative sentiment words in your quote(s).

    • If extra sheets are lying around, feel free to work on those quotes, too!

    • You can also respond at bit.ly/SWTDASent.

Let’s Dive Deeper

  • What would we need to do to understand the sentiment of a character or a movie?

  • At each table is one computer connected to the TVs at those desks.

  • Each computer has a program called RStudio on the screen, which allows you to program in the language R.

  • R is a statistical analysis programming language and is one of the two most common languages in the field of data science (Python is the other)

What you will need…

A row of a dataset runs left to right. A column of a dataset is verticle (think the lettered columns in a Google Sheet).

  • filter: This function keeps rows with words spoken by “Vader”

  • count: This function counts how many times each word appeared

  • group_by and slice_max: These functions are taking the counts, keeping the top 5 most common words that have positive sentiment and that have negative sentiment.

A More Advanced Analysis

  • Dennis Bakhuis scraped all of Wookipedias information on Star Wars

  • He posted all of his work HERE.

  • One fantastic result is a network plot.

Data Science and Statistics at Cornell

  • Courses take advantage of the block plan, where we can learn and practice a concept in class.

  • Interactive labs in most of our major courses.

  • Competitions, conferences, socials and more.

Acknowledgments

  • Star Wars is owned by Lucasfilms. I do not have any rights to this information.

  • Tidy Text by Julia Silge and David Robinson

  • Kaggle Report by Xavier Vivancos García